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Abstract 

Background: The problem of approximate string 
matching is important in many different areas such 
as computational biology, text processing and pat- 
tern recognition. A great effort has been made 
to design efficient algorithms addressing several 
variants of the problem, including comparison of 
two strings, approximate pattern identification in 
a string or calculation of the longest common sub- 
sequence that two strings share. 
Results: We designed an output sensitive algo- 
rithm solving the edit distance problem between 
two strings of lengths n and m respectively in time 
0{{s — \n — m\)min{m, n^ s) -\- m -\- n) and linear 
space, where s is the edit distance between the 
two strings. This worst-case time bound sets the 
quadratic factor of the algorithm independent of 
the longest string length and improves existing 
theoretical bounds for this problem. The imple- 
mentation of our algorithm excels also in practice, 
especially in cases where the two strings compared 
differ significantly in length. 

Conclusions: We have provided the design, anal- 
ysis and implementation of a new algorithm 
for calculating the edit distance of two strings 
with both theoretical and practical implications. 
Source code of our algorithm is available at 
http://www.cs.miami.edu/~dimitris/edit_distance 



Background 

Approximate string matching is a fundamental, 
challenging problem in Computer Science, often re- 
quiring a large amount of computational resources. 
It finds applications in different areas such as com- 
putational biology, text processing, pattern recog- 
nition and signal processing. For these reasons. 



fast practical algorithms for approximate string 
matching are high in demand. There are several 
variants of the approximate string matching prob- 
lem, including the problem of finding a pattern in 
a text allowing a limited number of errors and the 
problem of finding the number of edit operations 
that can transform one string to another. We are 
interested in the latter form in this paper. 

The edit distance D{A, B) between two strings 
A and B is defined in general as the minimum cost 
of any sequence of edit operations that edits A into 
B or vice versa. In this work we will focus on the 
Levenshtein edit distance [1], where the allowed 
edit operations are insertion, deletion or substitu- 
tion of a single character, with each operation car- 
rying a cost of 1. This type of operation is often 
called unit-cost edit distance and is considered the 
most common form. The weighted edit distance 
allows the same operations as the Levenshtein edit 
distance, but each operation may have an arbitrary 
cost. 

In the literature there exist a number of al- 
gorithms dealing with the calculation of the edit 
distance between two strings. The basic dynamic 
programming algorithm that solves the problem in 
0{mn) time and linear space has been invented 
and analyzed several times in different contexts 
[2-7], pubhshed between 1968 and 1975. Early 
on there was an algorithm by Masek and Pater- 
son [8], building on a technique called the "Four- 
Russian paradigm" [9], which computes the edit 
distance of two strings over a finite alphabet in 
time 0{mn/ log n). This algorithm is not appli- 
cable in practice, since it can outperform the ba- 
sic algorithm only then the input size is exceeding 
40GB. All these algorithms can also be used to 
calculate the alignment of two strings, in addition 
to their edit distance. A modification of the basic 
algorithm by Hirschberg [10] allows the alignment 
calculation to be performed using linear space as 
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Figure 1: Dependency graph 



well. 

A few years later in 1985, Ukkonen arrived at 
an 0{s ■ minim, n)) time algorithm, using space 
0{m,in{m,,n,s)) [11], where s is the edit distance 
of the two strings compared, creating a very effi- 
cient output sensitive algorithm for this problem. 
The following year, Myers published an algorithm 
for the Longest Common Substring (LCS) problem, 
which is similar to the edit distance problem, which 
has 0(s^ + (?n+n) log(TO-l-n)) time and linear space 
complexity [12]. In achieving this result, a general- 
ized suffix tree of the input strings, supplemented 
by Lowest Common Ancestor (LCA) information, 
has to be used, which renders the solution imprac- 
tical and only of theoretical value. The practical 
version of that algorithm needs 0{s{m -\- n)) time. 
From the other hand, a variation of Ukkonen's al- 
gorithm using 0{s ■ min{s,m,n)) space leads to 
an efficient, straightforward implementation, using 
recursion. Lastly, the basic algorithm, although 
theoretically inferior, is the most commonly used, 
owing to its adaptability, ease of implementation, 
instruction value, and speed, the latter being a re- 
sult of low constant operations. 

In this paper we will present anO((s — |n — ?7i|)- 
min(m, n, s) + m + n) time and linear space algo- 
rithm to calculate the edit distance of two strings, 
which improves on all previous results, the imple- 
mentation of which is practical and competitive 
to the fastest algorithms available. The quadratic 
factor in the time complexity now becomes inde- 
pendent of the longest string, with the algorithm 
performing its best when the two strings compared 
differ significantly in size. 



Methods 
Definitions 

In this section we closely follow the notation and 
definitions in Rcf. [11]. Let A = aia2...a„ and 
B — bib2 ■ ■ ■ bm he two strings of lengths n and m 



respectively, over a finite alphabet S. Without loss 
of generality, let n > m. 

To define the edit distance between two strings 
A and B, we will let the possible editing opera- 
tions be deletion, insertion and substitution. Then 
we define edit distance as the minimum number of 
character insertions, deletions and substitutions to 
transform string A to string B. By that definition, 
each editing operation has a cost of 1 and this edit 
distance in bibliography is usually referred to as 
Levenshtein edit distance [1]. Edit operations can 
be generalized to have non-negative costs, but for 
the sake of simplicity in the analysis of our algo- 
rithm we will concern ourselves only with the Lev- 
enshtein edit distance. We also assume that there 
is always an editing sequence with cost D{A, B) 
converting A into B such that if an cell is deleted, 
inserted or changed, it is not modified again. Un- 
der these assumptions the edit distance is symmet- 
ric and it holds < s < max{n, m). Since n > m 
and there is a minimum number of n — m indcls 
that need to be applied in transforming A into B, 
the last equation becomes n — m < s < n. The 
insertion and deletion operations are symmetric, 
since an insertion, when transforming A to B, is 
equivalent to a deletion in the opposite transfor- 
mation, and vice versa. Both operations will be 
referred to as indels. 

The basic dynamic programming algorithm em- 
ployed to solve the edit distance problem, invented 
in a number of different contexts [2-7] , makes use 
of the edit graph, an (n + 1) x (m + 1) matrix (dij) 
that is computed from the recurrence: 

doo = 

dij = iiiin( di-i,j-i + (If tti = bj then else 1) 

di-i,j + 1, 

di,j-i + 1), i> or j > 0. 

This matrix can be evaluated starting from doo 
and proceeding row-by-row or column-by-column. 
This process takes time and space 0{mn) and pro- 



duces the edit distance of the strings in position 
dmn- The cehs of the matrix (nodes of the graph) 
have dependencies based on this recurrence, form- 
ing the dependency or edit graph, a directed acyclic 
graph that is shown in Fig[TJ All edit graph nodes 
will be referred to as cells and all graph edges (edit 
operations) will be referred to as transitions. 

To refer to the diagonals of (dy) we number 
them with integers — m, — m-t-l, . . . , 0, 1, . . . , n such 
that the diagonal denoted by k consists of those dy 
cells for which j — i = k. The diagonal n—m, where 
the final value d„in resides, is special for our pur- 
poses and we will call it main diagonal. The matrix 
cells between diagonals and n~m (inclusive) con- 
sist the center of the edit graph/matrix, the lower 
left triangle between diagonals —1 to —m will be 
called the left corner of the graph and upper right 
triangle between diagonals n — m+l and n will be 
called the right corner of the graph. 

A path in the edit graph is a series of transitions 
connecting cells, similar to a path in a directed 
graph. Whenever we generally refer to a path, we 
will assume that the final cell it reaches is dmn- 
The optimal path will be a path originating at dpo, 
and for which the sum of the costs of its transitions 
is minimal among all paths from doQ. 



The concept 

The basic dynamic programming algorithm evalu- 
ates unnecessary values dij. This observation led 
Ukkonen [11] design an algorithm that is diagonal- 
based and computes cell values only between the 
diagonals — s and n — m + s. He also used the 
observation that, under the edit operations' cost 
scheme discussed previously, the values in a di- 
agonal are monotonically increasing, where for 
Levenshtein edit distance costs it is furthermore 
di+ij+i £ {dij,di,j + 1}. 

Both Ukkonen [11], for calculating the edit dis- 
tance, and Myers [12], for calculating the length 
of the Longest Common Substring (LCS) of two 
strings, a problem closely related to the edit dis- 
tance, designed their algorithms with a common 
feature: The iterations in evaluating the edit graph 
cells were score based, as opposed to column or row 
based in the basic algorithm. In each step they 
would increase the edit distance -D by 1, starting 
at 0, and evaluate all cells with values dij < D, 
meaning cells reachable with edit distance D, often 
omitting cells not contributing to the next itera- 
tion, by considering transitions between cells where 
the values are incremented. 

The algorithm we present here builds on all pre- 
vious observations and the main iteration is score 
based as well. But we also make use of the follow- 
ing facts: 



1. A number of indels (n — to) is unavoidable 
when the two strings considered differ in size. 

2. Additional indels arc unavoidable when the 
optimal path strays away from the main di- 
agonal. 

3. Certain cells do not contribute to the optimal 
path or their contribution is redundant. 

The first point is obvious, since the optimal 
path, starting at diagonal and ending at diago- 
nal n — m, can only use indels to progress through 
diagonals. To argue towards the second fact we 
notice that every time a path follows an indel from 
diagonal fc to fc — 1 when k < n~ m or from diago- 
nal fc to fc -|- 1 when k > n — m, that path will need 
to follow another compensatory indel at some later 
point, in order to reach the main diagonal, where 
the target cell dmn resides. 

In order to address the third fact, we will intro- 
duce the concept of dominance. We will say that 
cell dij dominates cell dki if no path through dki de- 
fines a better edit distance than the optimal path 
through dij. This implies that dij has an equal 
or better potential to belong to the optimal path 
(which defines s) than dki , and thus the latter and 
its paths do not need to be considered further. 

Some dominance relations between cells can be 
spotted easily. Let us consider all possible paths 
starting from doo ■ If a match exists between char- 
acters ai and bi (ai = 6i), then we do not need to 
consider indel transitions from doo to dio and doi- 
In that case actually, all cells dofe for 1 < fc < n and 
dko for 1 < fc < TO are dominated by du. Since oi 
matches bi, cell du obtains the value of 0. Then 
all cells dik, 2 < k < n can obtain a value of fc — 1 
through a path traversing du. Any path through 
doi cannot result in a smaller value for cells dik, 
2 < k < n, since cells do,fe-i have the same value. 
In a similar manner, cells in the second column 
starting at the third line are dominated by du. 
These arguments apply not only to dpo but to all 
dij in general, proving the following: 

Lemma 1. A cell dij is dominated by d^+ij+i if 



Let us now consider what happens when ai ^ 
bi. In this case we can still find dominated cells 
in the second row and column, depending on the 
first matching character position in each. Let us 
assume that the first character in A matching bi 
is ai, 2 < I < n. Ah cells dik, 2 < k < I - 1 
are dominated by du, for the same reasons that 
were described earlier. And a similar domination 
relation exists in the columns. 

Before we generalize the dominance relation 
through a theorem, we will introduce a new scor- 
ing scheme to take advantage of the indel unavoid- 
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Figure 2: Edit graph cell values and optimal paths under different scoring schemes 



ability, which will create another optimization cri- 
terion, monotonicity in the rows and columns of 
certain parts in our graph. For the new scoring 
scheme and for the rest of the description of our al- 
gorithm, we will divide our matrix into two parts, 
separated by the main diagonal. The first part in- 
cludes the center and the left corner of the matrix, 
where the second part includes the right corner of 
the matrix, together with the main diagonal (which 
is shared by both parts). The scoring scheme and 
the algorithm described further on will be analyzed 
on the part of the matrix left of the main diago- 
nal, although all theory works symmetrically on 
the part right of the main diagonal, by substitut- 
ing the rows with columns and vice versa. 

The new scoring scheme, for the left part of 
the matrix, is implemented as follows: Every ver- 
tical transition (indel) incurs a cost of 2, since it 
strays away from the main diagonal and creates 
the need of another horizontal indel to compen- 
sate. All horizontal transitions do not carry any 
cost. The match and substitution costs remain 
and 1 respectively. To obtain the edit distance s, 
we add n — m to the value of cell d^n . The trans- 
formation is illustrated through an example in Fig. 

m 

To guarantee the correctness of an algorithm 
based on that scoring scheme, we will now prove 
the following lemma: 

Lemma 2. Under the new scoring scheme, the edit 
distance of A and B remains unchanged. 

Proof. It has already been shown that the edit dis- 
tance is defined by an optimal path of the fewest 
possible edit operations carrying a cost, resulting 
in the minimum score at dmn- We will prove the 
following two statements: 

1. The score obtained from the optimal path re- 
mains unchanged and 

2. No other path can lead to a sequence of fewer 
edit operations and thus a smaller score / 
edit distance. 



To prove the first statement, we note the following: 
The number of match and substitution transitions 
in the optimal path do not alter the edit distance 
in the new scoring scheme, since the costs of these 
operations have not changed. With the optimal 
path starting at diagonal and ending at diagonal 
n~m, there are n — m indels which can be omitted 
from our calculation, since with the new scoring 
scheme we add these at the end. The only remain- 
ing edit operations to examine are vertical indels 
left of the main diagonal and horizontal indels right 
of the main diagonal, which must be accompanied 
by compensatory horizontal and vertical indels in 
the respective parts, or the optimal path cannot 
end up in the main diagonal. Since these indels 
come in pairs, with half of them carrying the cost of 
2 and half the cost of in the new scoring scheme, 
the final edit distance remains unchanged. 

The second statement follows from the previous 
arguments, since any path under the new scoring 
scheme carries the same cost as before, so a new 
path with a better score than the previous optimal 
path score contradicts the optimality of the latter 
under the original scoring scheme. D 

Since with the new scoring scheme horizontal 
transitions do not carry a cost, the values of cells 
in every row in the left part of the matrix are 
monotonically decreasing. The same holds for the 
columns in the right part of the matrix, which leads 
to the following: 

Corollary 1. Under the new scoring scheme, the 
values of cells in rows left of the main diagonal and 
in columns right of the main diagonal are monoton- 
ically decreasing as the indices of the corresponding 
cells increase. 

Let us now consider all cells in a specific row 
X, left of the main diagonal. Values on this row 
are monotonically decreasing and we only need to 
keep the information of the first cells from the right 
where the values are changing (the leftmost cells of 
a series of cells with the same value) , since the rest 



of the cells are dominated (can be reached with 
cost from the aforementioned cells). Now, if we 
have two consecutive dominant cells dxy and dxz 
on row X, with y < z and d^y = d^z + 1, then the 
value of dxy can be propagated through a transi- 
tion to row a; + 1 only if a match exists between 
bx and Uk, with y < k < z. In order to be able to 
locate such matches in constant time, we will cre- 
ate lookahead tables for each letter of the alphabet 
E, which can point to the next matching character 
from strings A and B. Basically these lookahead 
tables will be able to answer the question: Given a 
character c S S and a position 1 < fc < n, what is 
the smallest index I > k such that a; — c? And the 
same for string B. Such a lookahead table can be 
easily constructed in time and space 0{{n+m)\^\), 
which for a fixed alphabet of constant size is linear, 
by traversing both strings in reverse order, once for 
each character of the alphabet. 

One can easily verify that lemma [T] still holds, 
based on the same arguments used to prove it, un- 
der the new scoring scheme. In addition, the fol- 
lowing corollary holds: 

Corollary 2. A cell dij with value D dominates 
all cells di^k.j-k, < fc < max{i,j) with values 
> D. 

Proof. It is easy to see, with a simple inductive ar- 
gument, that a cell dij dominates all parental cells 
on the same diagonal with the same score. Since 
any cell dominates itself with a higher score (be- 
cause every path from that cell will have a higher 
score equal to the difference of the two scores) , the 
corollary follows. D 



The algorithm 

The algorithm works separately on the two parts 
of the matrix left and right of the main diagonal. 
For the description of the algorithm will consider 
only the part of the matrix lying left of the main 
diagonal, with the assumption that all operations 
are symmetric on the right part of the matrix. An 
exception will occur when we describe the interface 
of the two parts. 

Our edit distance algorithm is score based. 
On each iteration the edit distance score is incre- 
mented by 1 and the part of the edit graph that can 
be reached with the current score is determined. 
The initial score is 0, although we should keep in 
mind that, since at the end we add n — m to the 
score - adjusting for the unavoidable indels that 
we get for free on horizontal transitions ~ it can 
be considered as if the score is initialized with the 
value n — m. 

During each iteration, we keep the values and 
positions of the cells we work with in a double 
linked list, which will be referred to simply as list. 



To store the position of a cell we actually need 
only the column index where the cell resides, for 
reasons that will be explained later. The initial- 
ization phase starts with the determination of the 
cells which can be reached with a score of 0. Since 
all horizontal and match diagonal transitions (di- 
agonal transitions corresponding to matching char- 
acters) have a cost of 0, we follow horizontal tran- 
sitions until we locate a match, then advance to 
the next line and repeat. The process ends when 
we reach the main diagonal. We do not need to 
keep information on all cells with value, the first 
cell with a value of on each line suffices, since 
all further cells are dominated. These dominant 
leftmost cells can be located in constant time for 
each line, by using the lookahead tables. When 
we encounter a series of matches on the same di- 
agonal, we only need to keep the value of the last 
(bottom-right) cell, since all other cells are domi- 
nated. The indices of cells accessed through this 
process increase monotonically, as we advance for- 
ward through rows, columns and diagonals. The 
initialization finishes when the main diagonal is 
reached. Thus at the end of the initialization step 
we have a list of cells with value, each of which 
resides on a different row, column and diagonal of 
the matrix. An example of the initialization phase 
can be found in Fig. [3^. 

On each subsequent iteration of the algorithm 
and with each increasing value of the score, the 
linked list is updated with new cells that can be 
reached from members of the list. The algorithm 
at iteration D, with D also being the current score, 
starts from the top of the list and processes one 
cell at a time. For each list cell examined having a 
value of D — 1 or 13 — 2, as will be proved in lemma 
[3l we either follow a substitution transition, if the 
cell's value is Z3 — 1 or a vertical indel transition 
if the cell's value is 13 — 2. Lets assume we are 
examining list cell dij = D — 1. We know that 
di-(_ij_|_i = D, since if rfj+ij+i < D it would al- 
ready be included in the list, unless dominated by 
another cell in the list, which is impossible since 
then dij would in turn be dominated by di+ij+i 
and would not be in the list during the current 
iteration. We now find the largest k for which 
bi+k = cij+k, k >1 and insert cell di+k,j+k in the 
list. That is the last cell in a series of match tran- 
sitions, starting at di+i.j-i-i, if any exist. Next, we 
examine the cells following dij in the list and re- 
move the ones that are dominated by di+k.j+k- At 
this step, list cells dop in rows a < i -\- k and on 
diagonals a — p such that i — j <o~p<n^m 
are removed, all being dominated as proved later 
in theorem [TJ Starting now at cell di-^k,j+k, we 
repeat the process performed in the initialization, 
with the difference that for each new cell inserted 
in the list, all subsequent cells in the list that are 
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(e) Score 4 iteration 

Figure 3: Edit distance algorithm iterations. The main diagonal is depicted in blue, iteration transitions 
are drawn in red and green alternatively. Cells whose values are presented have been inserted in the list 
at the end of each iteration, where cells that their values are circled have been removed from the list, 
dominated by the cells they connect with arcs. 



dominated by the new member are removed. This 
process will stop once the next identified match 
in the lookahead table falls inside the dominated 
area. Precisely, if dop is the last cell with value D 
that was inserted in the list, the next match from 
the lookahead tables resides at diagonal q and the 
next cell in the list resides at a diagonal p < q and 
row r > o, then the process of inserting new cells 
derived from dij is terminated and we proceed to 
the next cell in the list. 

Each iteration finishes once we reach the main 
diagonal. The reader can follow the procedure, 
through the five iterations in calculating the edit 
distance of strings A = 'GATCGCGACC and B = 
'ACTTCTA', in Fig. H 

One special case that was not covered in the 
above description is the handling of a cell insertion 
following a vertical indel transition, when another 
dominated cell on the same diagonal exists in the 
list. In this case, the only position the dominated 
cell can occupy is previous to the current cell ex- 
amined, from which the transition emanated. This 
results in the removal of the dominated cell. This 
special case only requires a constant number of op- 
erations and does not alter the complexity of the 
algorithm. 

As already mentioned, the part of the matrix 
right of the main diagonal is processed in a sym- 
metric way. At the end of each iteration, the cells 
of the main diagonal, which belongs to both parts, 
have to be updated. These cells reside at the end of 
the lists for both parts and the update is performed 
in constant time as well. 

We will now proceed to prove the following the- 
orem: 

Theorem 1. Cell dij on diagonal i~ j with value 
D dominates all cells d^i in the list with k < i, 
i— j<k — l<n~m and values < D, meaning all 
list cells in rows above it and columns with larger 
indices. 

Proof. Since horizontal transitions carry a cost of 
0, all cells in row i and column / with j < I < n — m 
have a score of at most D. All cells dki in the list, 
residing in diagonals k — l with i—j < k — l < n — m 
and in rows k with k < i lead diagonal transitions 
to cells dk+i.i+i with score at most D, since a; ^ bk 
(or dki would not belong to the list, dominated by 
dk+i.i+i)- This implies that no diagonal transi- 
tion from these cells can produce a value smaller 
than D in any cell on row i and column > j via 
a path passing through these cells, since values in 
the paths are monotonically increasing (because all 
edit operations have non-negative costs). If wc now 
examine the vertical transitions emanating the dki 
cells under consideration, they also result in paths 
propagating scores at least D, which again cannot 
result in a better score on the cells on row i and 



column > j . All cells on diagonals < i — j do not 
need to be considered, since they cannot be reached 
from the claimed dominated cells of this theorem, 
unless a path reaches them through a cell in diag- 
onal i — j. But in corollary [2] we showed that cells 
on this diagonal with scores > D are already dom- 
inated by dij . Thus all d^i cells are dominated by 

The next corollary follows from the domination 
theorem [T] 

Corollary 3. No two cells in the list reside on the 
same column. 

Proof. Before a new candidate cell dij is inserted in 
the list, any list cell on the same column will be re- 
moved, since it is dominated by the newly inserted 
cell, based on the previous theorem. D 



Now we have the necessary tools to prove 
following lemma: 



the 



Lemma 3. When iteration D starts, with I < D < 
s — (m — n), all cells in the linked list have either 
a score of D — 1 or D — 2. 

Proof. Initially, after the initialization, the list 
holds cells with value 0, so the lemma holds. Ev- 
ery time a cell is inserted in the list it will remain 
until it is dominated by another cell or the algo- 
rithm terminates. Unless a cell with score D in the 
list is dominated and removed before its transitions 
are examined, when the algorithm reaches that cell 
the diagonal transition emanated from the cell will 
produce the next candidate, with score D + 1, to 
be inserted in the list. The second time this cell 
is visited, the vertical transition from it will be 
examined. In that case, the next candidate with 
score D + 2 will dominate the current cell, accord- 
ing to the previous theorem. Thus, even if a cell is 
not dominated by another inserted cell, it will be 
dominated by its siblings. D 

A direct consequence of the previous lemma is 
the following: 

Corollary 4. At most two cells in the list reside 
in the same diagonal, and their values differ by 1. 
This holds for same row list cells as well. 

A pseudo-code description of the algorithm is 
presented below. The description excludes special 
cases requiring substitutions of the currently ex- 
amined cells of the list and only presents the op- 
erations of the algorithm in the part of the matrix 
left of the main diagonal. The procedure interfac- 
ing the left and right linked lists is omitted as well. 
The code can be studied in more detail from the 
available code. 



Initialize lookahead arrays X 

Initialize linked list L 

score 13 := 

line I := 

column c:— 

while Not reached main diagonal do 

insert dix :=X[a;][c] into L 

c := X 

1 + + 
end while 

while Not reached cell dmn do 
D + + 

Current Cell dij := L ^head 
repeat 

if dij = D — I then 

dij :=processjnext_candidate{di^ij^i) 
else 

dij := process-next_candidate{di+i^j) 
end if 
until dij = L ^head 
end while 

Function process Jeft_candidate(d/c;) 
while ai = bk do 

fc + + 

1 + + 
end while 
Insert dki in list L 

Remove dominated dij — >next by dki from L 
while not reached diagonal of dij — >next do 

processJeft_candidateQi.[ak+i][l + 1]) 
end while 



return di 



*next 



Algorithm complexity 

The algorithm described in the previous section is 
score based and as such the main loop executes an 
equal number of times with the value recorded at 
cell dmn of the edit graph. Since we add the value 
n ~ m to that score in order to obtain the edit 
distance of strings A and B, the total number of 
iterations is equal to s — \n — m\. 

At all times during the execution of the al- 
gorithm the linked list contains at most m cells, 
which is a direct consequence of corollary [3l Also, 
due to corollary m there can be at most 2s cells in 
the list at any given time, since the maximum num- 
ber of diagonals on which the algorithm processes 
cells is s, consisting of the center of the matrix and 
diagonal bands of size (s — \n — in\)/2 from each 
side of the center, accessed while the algorithm it- 
erates. Basically, for every two iterations of the 
algorithm, one further diagonal from each side of 
the center of the matrix is accessed. 

All cells in the list are accessed in order and 
without backtracking during each iteration. Each 



cell undergoes through a constant number of struc- 
tural accesses, once when it is inserted in the list, 
once when it is removed and two times when the 
diagonal and vertical transitions from this cell are 
examined, if there is a chance before it is domi- 
nated. During each iteration there are other cells 
accessed, the candidates for insertion in the list. 
While processing these cells we are advancing both 
the indices of columns and rows without backtrack- 
ing, which proves, as with list cells, that there are 
at most m or s candidate cells examined during 
each iteration. 

A candidate cell may be accessed several times 
while compared to a list cell, in order to deter- 
mine a dominance relation. A list cell can also 
be accessed several times during the same process, 
to check whether it is dominated. The amortized 
cost for each cell though is constant. Every time 
a candidate cell is re-examined, a cell from the list 
has been removed. And every time a list cell is 
re-examined, in the previous step it was not dom- 
inated by a candidate cell, the latter then having 
being inserted in the list and not being examined 
again on that iteration. Since each time we ad- 
vance through either a candidate or a list cell, 
and since both sets have 0{min{m, s)) cells (un- 
der the assumption that m < n), the total number 
of constant time operations during an iteration is 
0{min{m, n, s)). 

This analysis demonstrates that the total run- 
ning time of our algorithm is 0{{s — \n — m\) ■ 
min(m, n, s) + m + n), where the last linear m + n 
component represents the time necessary to initial- 
ize the lookahead tables. It can be easily verified 
using simple algebra that s—\m — n\ < min{m, n), 
which provides another less tight upper bound of 
the worst case time behavior of the algorithm, 
0{min^(m, n, s)+m+n). We can therefore observe 
that the quadratic factor in the time complexity is 
independent of the longest string being compared. 
The space usage of this algorithm is 0{m + n), 
dominated by the size of the lookahead tables kept 
in memory. This completes the proof of the next 
theorem: 

Theorem 2. The edit distance s of two strings 
A and B with lengths n and m respectively can be 
computed in time 0{{s — \n — m\) ■ min{m, n, s) + 
m + n) and in space 0(m + n). 



Results 

We have implemented our new algorithm to test 
its performance in practice. For comparison pur- 
poses, we implemented the basic 0{mn) algorithm, 
also known as Needleman-Wunsch [3], as well as 
the Ukkonen 0{s ■ min{m,n)) algorithm [11]. All 
algorithms were implemented in perl, using the 
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Figure 4: Edit distance calculations on random strings with different length ratios, comparing the per- 
formance of our, Ukkoncn's and the basic algorithms 



same input/output procedures and no optimiza- 
tions. Benchmarking was performed with the 
benchmark perl module for the experiments av- 
eraging a large number of random runs, and the 
time Unix command for individual experiments, the 
same method always used across algorithms. All 
tests were performed on an 8GB RAM 2.93GHz 
Intel processor IBM compatible desktop machine, 
running ubuntu linux. In all test cases the data 
completely fit in the main memory. 

Since perl does not support pointer structures 
efficiently, we implemented the double linked list 
with arrays, using the fact that no two cells in 
the list can reside on the same column. This way 
we access list cells using their column index. As 
such, the list occupies more space than the mini- 
mum possible, where the implementation may have 
been more efficient in another programming lan- 
guage supporting these structures. 

Ukkonen's algorithm implementation was 
based on the outline found in [11] and the more de- 
tailed description found in [13]. The version used 
is particularly simple by making use of recursion, 
but has larger than linear space demands, specif- 
ically 0{s ■ maxim, n)). The basic algorithm was 
implemented using linear space and row-by-row 
iterations. 

The first two experiments were run on ran- 
dom sequences over alphabets of 4 and 20 char- 
acters respectively, similar to random DNA/RNA 
and amino acid sequences. The length of the first 
sequence from the two compared was set at 1000 
characters, where the length of the second sequence 
varied between 1000 and 3000 characters. We ex- 
amined a total of nine length ratios n/m values 
between 1 and 3 (1 < n/m, < 3). For each length 
ratio, 100 different comparisons were run, with the 
execution time and edit distance values averaged 
among these. The results are depicted in Fig. [H 



In these simulations it is worth noticing signif- 
icant performance improvement of the new algo- 
rithm with increasing length ratio of the random 
strings, although the total length of the strings is 
increasing. This is not surprising, since the num- 
ber of iterations s — |m — n| is decreasing, caused 
by a slower increase in edit distance than difference 
between the lengths of the two strings. 

Ukkonen's algorithm is under-performing when 
comparing random strings over a large alphabet, 
because of the large expected edit distance value 
in these cases. This algorithm is designed for com- 
paring similar strings, which is the case most often 
encountered in practice. In contrast, the basic al- 
gorithm, owing to its simplicity, is performing con- 
sistently and surpassing the other algorithms when 
the edit distance is large compared to string length, 
unless when the s — |n — m| value becomes small 
enough, where our algorithm takes the lead. 

Next, we designed computational experiments 
performing comparisons most often encountered in 
practice, drawn from the computational biology 
domain. In all examples the sequence pairs exam- 
ined have comparable lengths, not differing more 
than 5%. The results are presented in Table[T] The 
first simulation involved 1000 random sequence 
pair comparisons from a pool of approximately 
6800 vetted 16S ribosomal RNA sequences, pro- 
vided by the Ribosomal Database Project (RDF) 
at http://rdp.cme.msu.edu. These sequences aver- 
age about 1350 characters in size, drawn from an 
alphabet of size 4. A random pair of 16S rRNA 
sequences from the same genus and another from 
the same class but different order are compared in 
the next two lines, followed by a comparison of two 
viral genomes and two virion proteins. 

As these results demonstrate, the performance 
of our algorithm compares favorably to Ukkonen's 
asymptotically slower but with lower constants al- 



Sequence A 


Sequence B 


Alphabet 

size 


(Average) 
length 


Our Ukkonen's Basic 
algorithm algorithm algorithm 

(sec) (sec) (sec) 


(Average) 

edit 
distance 


Random 16S 
rRNA sequence 


Random 16S 
rRNA sequence 


4 


1350 


0.679 0.811 2.554 


421.3 


Hyphomonas 16S 
rRNA (AF082798) 


Hyphomonas 16S 
rRNA (AF082795) 


4 


1330 


0.25 0.18 2.14 


46 


Alphapioteobacteiia 16S 
rRNA (AJ238567) 


Betaproteobactcria 16S 
rRNA (AJ239278) 


4 


1320 


0.42 0.46 2.07 


318 


Cucumber necrosis 

virus genome 


Lisianthus necrosis 
virus genome 


4 


4790 


6.70 6.32 28.27 


1154 


Human poliovirus 1 
virion protein 


Human Rhinovirus A 
virion protein 


20 


870 


1.02 1.05 0.88 


472 



Table 1 : Algorithm performance comparing biologically related sequences of similar length 



gorithm, while the basic algorithm is outperformed 
in almost every case, except when matches are 
sparse. Performance comes with some cost though 
and it is interesting to note that the size of the 
program implementations of the three algorithms, 
the basic, Ukkonen's and ours, is 80, 160 and 700 
lines of code respectively. 

The perl implementations of all three 
algorithms used in this paper for perfor- 
mance comparisons can be downloaded at 



http://www.cs.miami.edu/~dimitris/edit_distance 



Conclusions 

In this paper we have provided the design, analysis 
and implementation of a new algorithm for calcu- 
lating the edit distance of two strings. This algo- 
rithm is shown to have improved asymptotic time 
behavior, while it is also demonstrated to perform 
very well in practice, especially when the lengths 
of the strings compared differ significantly. The 
performance of our algorithm in this case, which 
is encountered less often in instances of the edit 
distance problem, could find application in the re- 
lated Longest Common Subsequence (LCS) and 
other similar problems solved with dynamic pro- 
gramming techniques. 

Future directions for this algorithm include the 
investigation of further practical applications of the 
techniques described in other similar problems, as 
well as generalizing the results for arbitrary weights 
of the edit operations and/or covering additional 
edit operations such as swaps. 
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